[iris] Consolidate controller store layer around process-scoped stores by rjpower · Pull Request #4836 · marin-community/marin

rjpower · 2026-04-16T18:46:54Z

Summary

Reshape the controller's persistence layer so every query/mutation flows through process-scoped store objects instead of ad-hoc inline SQL. Four consolidations land together because they share caches, call sites, and invariant boundaries.

§1 — Endpoint cache becomes a store

EndpointRegistry moves from its own module into store.py as EndpointStore, reusing the same on_commit write-through hook pattern as the other stores. endpoint_registry.py is deleted.

§2 — Stores are persistent and owned by ControllerDB

Stores are no longer reconstructed per-transaction. Each is built once in ControllerDB.__init__ and reused across all transactions, with caches living on the store instance. ControllerDB exposes tasks, jobs, workers, dispatch, meta, users, reservations, endpoints properties.

§3 — DbContext bundles cursor + stores

New DbContext / DbReadContext frozen dataclasses bundle a cursor with every store. Transition helpers that used to take a raw cursor now take ctx: DbContext, so callers can reach any store by name without plumbing. ControllerDB.transaction() is replaced by transact() / read_ctx().

§4 — TaskFilter reifies the "simple task read" query shape

28 inline SELECT FROM tasks sites across transitions.py collapse into self._db.tasks.query(ctx.cur, TaskFilter(...)) with chunked IN-list handling for the SQLite host-parameter cap and optional jobs / job_configs joins. EndpointQuery moves to store.py alongside it.

Net: +779 / -2338 LOC.

Test plan

uv run pytest lib/iris/tests/cluster/controller/test_task_store_query.py -x — 17 new tests covering TaskFilter validation, empty-list short-circuit, chunking + limit interaction, and the join variants
Full lib/iris test suite: 793 passing
./infra/pre-commit.py --all-files --fix clean (ruff, black, pyrefly)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe79b17576

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T18:53:55Z

+            "UPDATE task_attempts SET state = ?, "
+            "finished_at_ms = COALESCE(finished_at_ms, ?), error = NULL "
+            "WHERE task_id = ? AND attempt_id = ?",
+            (int(r.attempt_state), r.now_ms, r.task_id, r.attempt_id),


Preserve attempt error and exit code during requeue

This requeue path clears task_attempts.error and never records exit_code, so retriable FAILED/WORKER_FAILED updates lose the diagnostics for the finished attempt. Before this refactor, the transition path persisted attempt-level error/exit metadata even when the task was moved back to PENDING; now tasks_for_job_with_attempts and related dashboard/debug flows can no longer show why the previous attempt failed.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-16T18:53:55Z

+        self._endpoint_store._load_all()
+        self._worker_store.reset_attr_cache()


Clear job-config cache after replace_from restore

replace_from() reloads endpoint and worker caches after swapping SQLite files, but it does not invalidate the new process-scoped JobStore config cache. If any job config was cached before restore, later reads can keep using stale config rows from the old DB, which can skew post-restore transition logic (timeouts/resources/retry policy) until process restart.

Useful? React with 👍 / 👎.

Phase 1 + Phase 3 follow-up to the store consolidation (PR #4836). Phase 1 — dead code and correctness: - Remove ControllerDB.read_ctx()/DbReadContext (0 callers). - Remove dead NotImplementedError branch in _filter_id_chunks for TaskFilter.task_ids + job_ids (no caller uses both). - Delete WorkerStore._attr_cache. Post-denormalization worker_attributes has a compound PK and <100 healthy-active workers per tick — an indexed read is cheaper than the in-memory mirror, and the old cache was buggy anyway (prune_old_data skipped invalidation, mutations happened outside on_commit). Rewrite healthy_active_with_attributes to read attributes in the same read snapshot. - Fix JobStore.get_config negative-cache pinning: don't cache misses. Phase 3 — structural simplification: - Delete MetaStore. The two last_submission_ms accessors become ControllerDB.get_counter/set_counter beside the existing next_sequence helper. The meta table itself stays (needed for priority_insertion sequencing and scheduler sort order). - Delete PruneStore. Redistribute to the stores that own the tables: prune_task_resource_history → TaskStore; prune_per_worker_history(table=...) → WorkerStore.prune_task_history + WorkerStore.prune_resource_history (split by table, no string dispatch); get_inactive_worker_before → WorkerStore; get_finished_jobs_before → JobStore; batch_delete → module-level helper in db.py. - Fold ReservationStore into DispatchStore (one method, moved replace_claims and converted the per-row INSERT loop to executemany while there). Net: -170 lines. Tests: 792 passing in lib/iris/tests/cluster/controller.

Restructure the controller storage layer into three distinct layers with clear ownership boundaries: - db.py: pure SQLite infrastructure (TransactionCursor, QuerySnapshot, read pool). No domain knowledge; no store ownership. - schema.py: DDL (Table/Column) plus row dataclasses. Introduces a @Projection decorator and pcolumn() field helper that derive the column→dataclass mapping from the dataclass definition itself, eliminating ~160 lines of hand-written decode boilerplate and unifying the three TaskDetailRow query variants on one decoder. - store.py: per-table domain stores (TaskStore / JobStore / WorkerStore / EndpointStore / DispatchStore / UserStore) plus ControllerStores, the process-scoped container. `ControllerStores.from_db(db)` is the single entry point; `transact()` and `read()` hand out a per-scope ControllerStore bundling a Cursor and the stores, and the caller can pass either `ctx` (an object with `.cur` and the stores) or `ctx.cur` into store methods. Read scope hands out a read-only connection, so any write raises OperationalError — locked in by a regression test. The inversion (stores own db, not the other way around) removes the prior circular setup where ControllerDB constructed and attached its stores. Stores no longer carry `self._db`; every method takes an explicit cursor. The `JobStore.attach_tasks` fixup for the TaskStore ↔ JobStore cycle is replaced by passing `tasks: TaskStore` into the two methods that need it. Deletes that fall out of the refactor: - endpoint_registry.py: folded into EndpointStore - constraints.py: unused - MetaStore / PruneStore / ReservationStore + the worker attribute cache: all dead or better expressed as plain store methods - test_reservation_holder_reset_regression.py: guarded a cache that no longer exists Tier-1 cleanups landed alongside: - sql_placeholders(n) helper (24 sites) - WhereBuilder for building optional AND-joined predicates (3 sites) - AttemptFinalizer.build(...) classmethod (7 of 8 sites) - Test helpers (`*_for_test` methods) moved from ControllerTransitions into tests/cluster/controller/_testing.py where they belong. ControllerTransitions, ControllerServiceImpl, ActorProxy, the dashboard, and the benchmark script are all rewired to take ControllerStores instead of ControllerDB. Tests updated accordingly; the remaining `state._db` references are intentional (infra calls like `db.snapshot()` / `db.wal_checkpoint()`).

rjpower added the agent-generated Created by automation/agent label Apr 16, 2026

chatgpt-codex-connector Bot reviewed Apr 16, 2026

View reviewed changes

rjpower requested a review from ravwojdyla April 16, 2026 23:25

rjpower force-pushed the claude/busy-beaver branch from 417dbb9 to c1a79e1 Compare April 17, 2026 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Consolidate controller store layer around process-scoped stores#4836

[iris] Consolidate controller store layer around process-scoped stores#4836
rjpower wants to merge 1 commit intomainfrom
claude/busy-beaver

rjpower commented Apr 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		self._endpoint_store._load_all()
		self._worker_store.reset_attr_cache()

Conversation

rjpower commented Apr 16, 2026

Summary

§1 — Endpoint cache becomes a store

§2 — Stores are persistent and owned by ControllerDB

§3 — DbContext bundles cursor + stores

§4 — TaskFilter reifies the "simple task read" query shape

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant